HypoRiPPAtlas as an Atlas of hypothetical natural products for mass spectrometry database search

Recent analyses of public microbial genomes have found over a million biosynthetic gene clusters, the natural products of the majority of which remain unknown. Additionally, GNPS harbors billions of mass spectra of natural products without known structures and biosynthetic genes. We bridge the gap between large-scale genome mining and mass spectral datasets for natural product discovery by developing HypoRiPPAtlas, an Atlas of hypothetical natural product structures, which is ready-to-use for in silico database search of tandem mass spectra. HypoRiPPAtlas is constructed by mining genomes using seq2ripp, a machine-learning tool for the prediction of ribosomally synthesized and post-translationally modified peptides (RiPPs). In HypoRiPPAtlas, we identify RiPPs in microbes and plants. HypoRiPPAtlas could be extended to other natural product classes in the future by implementing corresponding biosynthetic logic. This study paves the way for large-scale explorations of biosynthetic pathways and chemical structures of microbial and plant RiPP classes.

This manuscript from the Mohimani lab describes HypoRiPPAtlas, which is an algorithm that links RiPP genome mining with the global natural products social molecular networking program. This is a novel and important concept, given that the latter houses a large amount of mass spectrometry data for natural products from various biological extracts. Based on a reading of the prior reviews, the authors have improved some aspects of the paper, but unfortunately, the paper is not ready for publication. Some rather egregious omissions and incorrect statements in the original manuscript appear to have been corrected, but some new text sections range from misleading to factually incorrect. Equally troubling is that the text itself is meandering, disorganized, and of unjustifiable length. The narrative is often confusing and some of the sections seem chronologically out of place. With respect to the figures, the presentation is below average. Based on several factors, most notably the large number of errors, the authors give the impression they do not know the RiPP field and haven't read the papers in their own citation list. A major rewrite is necessary before this manuscript should be published in any journal, let alone a multidisciplinary journal.
Page 3: To call GNPS a "gold mine" equates to hyperbole. In my experience, this database is not particularly well suited for RiPP discovery and the quality of the MS data is so highly variable that it often precludes usage.
Page 5, elsewhere: Why do the authors state in multiple locations that RODEO is only applicable to lasso peptides? Have the authors read the numerous papers (from their own citation list) that show RODEO being used for other RiPP classes? I have never heard anyone else say that RODEO only works for lasso peptides.
I don't follow the logic progression of including SynBNPs in this paragraph. This is a really big pivot. One idea per paragraph, please. The same is true throughout the paper. This is a criticism to help the authors produce a more readable text.
"non-class-specific"? Class-independent would be less confusing Odd to say that NRPs are less modified than RiPPs when some have equal or greater complexity. Figure 3: the color scheme here is non-sensical and some of the fonts are bordering on microscopic. Why is an alternative core finder needed only if LanC or YcaO are in the gene cluster? This caption is not written in an accessible fashion.
Page 13: the sequence given for lanthi-1794 is clearly incorrect, since lanthipeptides require Cys. This is probably the sequence for lasso-1795. This paragraph bounces around between lasso and lanthi, the narrative hard to follow (illustrative of many places in the text). Figure 4: how can you call this "discovery" of radamycin when this is a known compound? Further, throughout the manuscript, the lasso peptides are referred to as novel, but how can this be true when they are previously reported? What is the URL for the hypoRiPPatlas server? I can see it listed in the GitHub but I don't believe it is listed in the manuscript. It would be delightful if all of the chemical structure drawings were as nice as what is shown on pages 25, 28, etc of the SI pdf. Please redraw all of the others using this style.
Supp Table 6: this table is an utter mess. Many of the RiPPs require multiple enzymes for formation, and the authors only list one, often not the class-defining enzyme, so it's confusing from the start. Next, lanthipeptide A, B, etc. are not recognized RiPP classes. There are subclasses of lanthipeptides but they are numbered, not letters. The number of remaining errors is staggering. Goadsporin is not a cyanobactin Cypemycin is not a lanthipeptide Thioviridamide is not a linaridin hominicin is not a lanthipeptide There are probably more errors than this.
Reviewer #4 (Remarks to the Author): The manuscript by Lee et. al., addresses a bottleneck in natural product research: connecting the huge amount of biosynthetic gene clusters detected by genome mining with the produced compounds. "HypoRiPPAtlas: an Atlas of hypothetical natural products for mass spectrometry database search" introduces a pipeline that identifies RiPP precursor genes within genomes and predicts the putatively encoded structures and fragmentation patterns. These can be subsequently identified in actual MS spectra and databases. This way the authors were able to identify several novel RiPPs from bacteria and plants showcasing the benefit of the pipeline. I believe that this manuscript is of high interest and high quality. After carefully reading the previous comments of the reviewers and point-by-point answers, I believe that the manuscript is ready to publish now.

Dear Dr Bratovic
Thank you for forwarding us very insightful reviews. Below please find answers to all questions raised by the reviewers.
I attached the revised manuscript as well as the detailed point-by-point answers to the reviews. We significantly revised and extended the paper to address the reviewers' questions (most additions were delegated to the Supplementary material to limit the length of the main manuscript). The changes/additions are shown in red.
Please see our responses below.

Reviewer #1
The authors have carefully addressed all my comments from the previous round of review. I have no further comments or suggestions.

Reviewer #3
C3.1. This manuscript from the Mohimani lab describes HypoRiPPAtlas, which is an algorithm that links RiPP genome mining with the global natural products social molecular networking program. This is a novel and important concept, given that the latter houses a large amount of mass spectrometry data for natural products from various biological extracts. Based on a reading of the prior reviews, the authors have improved some aspects of the paper, but unfortunately, the paper is not ready for publication. Some rather egregious omissions and incorrect statements in the original manuscript appear to have been corrected, but some new text sections range from misleading to factually incorrect. Equally troubling is that the text itself is meandering, disorganized, and of unjustifiable length. The narrative is often confusing and some of the sections seem chronologically out of place. With respect to the figures, the presentation is below average. Based on several factors, most notably the large number of errors, the authors give the impression they do not know the RiPP field and haven't read the papers in their own citation list. A major rewrite is necessary before this manuscript should be published in any journal, let alone a multidisciplinary journal. R3.1. Thank you for your suggestions. We have incorporated your recommendations into the manuscript. The revisions have been highlighted in red.

Major comments
C3.2. Page 3: To call GNPS a "gold mine" equates to hyperbole. In my experience, this database is not particularly well suited for RiPP discovery and the quality of the MS data is so highly variable that it often precludes usage.
R3.2. Thank you for the suggestion; we have removed the "gold mine" description and updated the paragraph as follows: Since 2015, global natural product social (GNPS) molecular networking infrastructure has brought together over two thousand mass spectral datasets from over five hundred principal investigators containing over seven hundred thousand samples obtained from microbial isolates, host-oriented and environmental communities [14]. Accompanied with molecular networking [15] (a network of mass spectra, where similar spectra are connected with an edge), GNPS is a valuable resource for future natural product discovery. However, over 98% of the billion mass spectra currently stored at GNPS represent the 'dark matter of metabolomics' [16] since all attempts to interpret them have been failed. This 'dark matter' likely consists of spectra of unknown molecules produced by BGCs encoded in existing genomic repositories.
C3.3. Page 5, elsewhere: Why do the authors state in multiple locations that RODEO is only applicable to lasso peptides? Have the authors read the numerous papers (from their own citation list) that show RODEO being used for other RiPP classes? I have never heard anyone else say that RODEO only works for lasso peptides. R3.3. Thank you for bringing this up.
In fact, the title of the original manuscript that reports RODEO mentions that this is specifically developed for lasso-peptides: C3.4. I don't follow the logic progression of including SynBNPs in this paragraph. This is a really big pivot. One idea per paragraph, please. The same is true throughout the paper. This is a criticism to help the authors produce a more readable text. R3.4. Thank you for your suggestions. We removed the following sentences from the paragraph.
Synthetic-Bioinformatic Natural Products (syn-BNPs) [27] have emerged as an alternative to isolation-based natural product discovery approaches. Using syn-BNPs researchers no longer have to isolate bioactive molecules to study their activity. However, the syn-BNP approach has largely focused on non-ribosomal peptides as they are less widely modified than RiPPs. Furthermore, seq2RiPP can serve as an upstream step to the syn-BNP approach, as it provides high quality annotations and verifies theoretical RiPPs via mass spectral search.
Instead, we are now mentioning syn-BNP approach in the first paragraph of the Introduction as follows: The natural products of cultured microbes have served as a major source of lead compounds for antibiotics [4], drug [5], food preservative [6], and analgesic agent [7, 8] discoveries. However, novel antibiotics are needed to combat antibiotics resistance, and a continued focus on the abundant molecules from cultured microbes is ineffective due to high rates of rediscovery. Traditional approaches rely on repeated fractionation and bioactivity testing, followed by isolation and structure elucidation of the molecules of interest, which is a timeconsuming and expensive process. The Synthetic-Bioinformatic Natural Products (syn-BNPs) [cite https://www.nature.com/articles/nchembio.2207], proposed as an alternative strategy, relies on predicting chemical structures with existing bioinformatic tools, and thus, its effectiveness is constrained by the limitations of these tools. C3.5. "non-class-specific"? Class-independent would be less confusing R3.5. Thank you for the suggestion. We have replaced non-class-specific with class-independent in the following paragraph: RiPPER [20] builds upon RODEO outputs to predict class-independent RiPP precursors by adding ORF prediction via a custom build of gene prediction software Prodigal [21]. C3.6. Odd to say that NRPs are less modified than RiPPs when some have equal or greater complexity. R3.6. Thank you for the suggestion. We removed the paragraph as shown in R3.4. C3.7.1 Figure 3: the color scheme here is non-sensical and some of the fonts are bordering on microscopic. Why is an alternative core finder needed only if LanC or YcaO are in the gene cluster? This caption is not written in an accessible fashion. R3.7 We thank you for your suggestion in Figure 3. We replied to each comment below. Why is an alternative core finder needed only if LanC or YcaO are in the gene cluster? This caption is not written in an accessible fashion. We apologize for the typo in including LanC. We have rewritten the description in the figure caption. The alternative core finding is enabled for cyanobacteria BGCs (which contain the YcaO gene motif) and plants. Figure 3: Bgc2orf and orf2core models. As illustrated in (a) and (b), from left to right, the red peptide is a RiPP ORF, and the yellow section is the RiPP core. The green blocks are two 1D CNNs, and the purple blocks are bidirectional LSTM with a dense layer and a CRF layer in bgc2orf and orf2core, respectively. (a) Bgc2orf model is a binary classifier that computes the probability of a given ORF peptide sequence being a RiPP ORF. Bgc2orf model consists of (1) a padding process and an embedding layer (shown in blue), (2) two 1D CNNs (shown in green), and (3) a single layer bidirectional LSTM, a flattening layer, and a dense layer (shown in purple). The output is a probability and the default cutoff is 0.5. (b) The orf2core model shares a similar architecture with bgc2orf. However, the flattening and dense layers are replaced with a conditional random fields layer (shown in purple), which predicts the probability of each amino acid is one of the < start >, < before >, < core >, < after >, < end > tokens. The orf2core model takes a RiPP ORF as input and identifies k N-terminal and k C-terminal cleavage sites given the predicted tokens, where k is a user-defined hyperparameter. N-and C-terminal cleavage sites are defined as the transition from < before > to < core > and from < core > to < after >, respectively. Then, cores are predicted based on the combination of N-and C-terminal cleavage sites. (c) An alternative core finder is used to search the repeated leader-follower patterns, which are highlighted in gray, and to identify the core sequence in the patterns, highlighted in yellow. The alternative core finding is enabled for cyanobacteria BGCs (which contain the YcaO gene motif) and plants.
Figure 3: Bgc2orf and orf2core models. As illustrated in (a) and (b), from left to right, the dark gray peptide demonstrates the process of filtering ORFs and identifying the core sequence, which is labeled in light gray. (a) Bgc2orf model is a binary classifier that computes the probability of a given ORF peptide sequence being a RiPP ORF. Moving from left to right, each ORF is assigned a probability, and those with probabilities higher than 0.5 pass the filter. Bgc2orf model consists of a padding process and an embedding layer two 1D CNNs, a single layer bidirectional LSTM, a flattening layer, and a dense layer. (b) The orf2core model shares a similar architecture with bgc2orf. However, the flattening and dense layers are replaced with a conditional random fields layer, which predicts the probability of each amino acid being one of the < start >, < before >, < core >, < after >, < end > tokens. The orf2core model takes a RiPP ORF as input and identifies k N-terminal and k C-terminal cleavage sites given the predicted tokens, where k is a user-defined hyperparameter. N-and C-terminal cleavage sites are defined as the transition from < before > to < core > and from < core > to < after >, respectively. Then, cores are predicted based on the combination of N-and C-terminal cleavage sites. (c) An alternative core finder is used to search repeated (at least twice) leader-follower patterns and identify the core sequence in the patterns, highlighted in gray, and identify the core sequence in the patterns, which are enclosed in boxes. The alternative core finding is enabled for cyanobacteria BGCs (which contain the YcaO gene motif) and plants.
C3.8. Page 13: the sequence given for lanthi-1794 is clearly incorrect, since lanthipeptides require Cys. This is probably the sequence for lasso-1795. This paragraph bounces around between lasso and lanthi, the narrative hard to follow (illustrative of many places in the text). R3.8. Thank you for pointing this out. We apologize for putting the inaccurate sequence, and subfigure label in the wrong order. The correct core sequence is fixed. And the lanthi-1749 is in 7c; and lasso-1795 is in 7b. Also, we change the order of the description and supplementary figure 6 and 7, which describes both lasso peptides first, and then the lanthipeptide. We have fixed it by moving lasso-1795 before lanthi-1794 in the main text as follows : Lasso-1648 is identified from Streptomyces NRRL B-2660, containing a N-terminal macrolactam ring between N 1 and D 8 (Figure 7a and Supplementary Figure S5). Based on Seq2ripp predictions, the PTM is applied by Asn-synthetase (PF00733.18). Lasso-1795 is identified from Streptomyces NRRL B-2660 and WC-3560, containing a N-terminal macrolactam ring between Q 1 and D 8 (Figure 7b and Supplementary Figure S6). Lanthi-1794 is identified from Streptomyces WC-3904. A dehydroalanine (Dha) located at S 6 , and three dehydrobutyrines (Dhb) located in T 2 , T 10 , T 15 are potentially connected to one of the cysteines in the core peptide, forming lanthionine (Lan) or methyl-lanthionine (MeLan) rings (Figure 7c and Supplementary Figure S7). The PTM is applied by Lantibiotic dehydratase (PF05147.6 and PF04738.6).
C3.9.1 Figure 4: how can you call this "discovery" of radamycin when this is a known compound? R3.9.1 Thank you for pointing this out. We apologize for this typo, and we have fixed it in the figure 4 caption: Figure 4: Rediscovery of radamycin by seq2ripp.
C3.9.2 Further, throughout the manuscript, the lasso peptides are referred to as novel, but how can this be true when they are previously reported? R3.9.2 We thank the reviewer for bringing this up. We modified the sentence as follows: The DTGHCSGVCTVLVCTVAVC core identified by seq2ripp for lanthi-1794 does not appear in the survey conducted by Walker et al. [47]. However, the precursors and cores for lasso-1648 and lasso-1795 were previously reported by RODEO [19]. Seq2ripp validated that these lassopeptides are expressed naturally by the producing microorganisms through mass spectral search.
C3.10. Figure 7c/d: I do not see the transporter genes that are indicated in the legend. R3.10. We have removed transport-related genes from Figure 7d legend. In Figure 7 do not see transporter-related genes in the hypothetical BGC.
C3.11. What is the URL for the hypoRiPPatlas server? I can see it listed in the GitHub but I don't believe it is listed in the manuscript. R3.11. Thank you for pointing this out. We now add the link to the Data and code availability section.
Instructions for using HypoRiPPAtlas are available from https://github.com/ mohimanilab/seq2ripp. C3.12. Supp Fig 6: Caption says lasso peptide and lanthipeptide. Seems like an error. R3.12. Thank you for pointing this out. We apologize for the typo. We have fixed the typo. Also, corresponding to R3.8, we changed Supplementary Figure  C3.14. It would be delightful if all of the chemical structure drawings were as nice as what is shown on pages 25, 28, etc of the SI pdf. Please redraw all of the others using this style. R3.14. Thank you for your suggestion. We have remade all chemical structures as shown below: